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ABSTRACT 



A method is proposed to assess the importance of 



differential item functioning (DIF) by estimating the largest 
possible fraction of the population in which DIF does not occur, or 
equivalently, the smallest possible portion of the population in 



Clogg, and B. G. Lindsay (1994) in the context of assessing the fit 
of an arbitrary model to a contingency table. Application of this 
procedure produces an estimate of the minimum proportion of the 
populatit that would have to be removed to make the rest of the 
population free from DIF, as well as information about the portion of 
the population that is the source of DIF. Simple methods for maximum 
likelihood estimation are described* Numerical results are presented 
for a simulated data set and actual data from the 1993 Advanced 
Placement Physics examination. (Contains 3 tables and 27 references.) 



which DIF may occur. The approach is based on latent class (C. C. 
Clogg, 1981) or mixture concepts, and was proposed by T.- Rudas, C. C. 
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Estimating the importance of differential item functioning 



Several methods have been proposed to detect differential item 
functioning (DIF), an item response pattern in which members of 
different demographic groups have different conditional 
probabilities of answering a test item correctly, given the same 
level of ability. In this paper, the mixture index of fit, proposed 
by Rudas, Clogg, and Lindsay (1994) is used to estimate the fraction 
of the populat ion for which DIF occurs, and this approach is 
compared to the Mantel-Haenszel (1959) test of DIF developed by 
Holland (1985; see Holland & Thayer, 1988). The proposed estimation 
procedure, which is noniterative, can provide information about 
which portions of the item response data appear to be contributing 
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Introduction 



The absence of .different ial item functioning (DIF) is regarded as an 
important aspect of test fairness by most educational researchers. 
The extensive literature on the detection and measurement of DIF is 
reviewed in Holland and Wainer (1993) and Camilli and Shepard 
(1994). 

In this paper we propose to assess the importance of differential 
item functioning by estimating the largest possible fraction of the 
population in which DIF does not occur, or, equivalently, the 
smallest possible portion of the population in which DIF may occur. 
This approach is based on latent clsiss (see Clogg, 1981) or mixture 
concepts and was proposed by Rudas, Clogg, and Lindsay (1994) in 
the general context of assessing the fit of an eirbitrary model ':o a 
contingency table. 

Let H be any model or hypothesis for a contingen..y table. Then any 
distribution P can be represented as 

(1) P = (l-?t)'l> + n't , 

where $ is a distribution in H, is an arbitrary distribution, and 
O^Tisl. The above representation is not unique. The mixture index of 
fit n* is defined as the minimum possible value of ir. 

It* = inf { '.t : P=( 1 -It) 'S+it4', > , 

and it is the smallest possible fraction of the population outside 
the model of interest, H. Rudas, Clogg, and Lindsay (1994) described 
a general method of obtaining maximum likelihood estimates of it* and 
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of constructing confidence intervals. The nonrestricted 
distribution, describes residuals, though not in the standard 
sense, and rr* is the total weight of these residuals. Ordinarily, 
residuals are defined with respect to a model that is assumed to 
hold in the entire population. By contrast, the residuals in this 
approach are defined in the context of representation (1), which is 
always true. The residuals describe the distribution in the part 
of the population in which hypothesis H is not true. Various 
interpretations of are discussed in Clogg, Rudas, and Xi (1995). 
In the present paper, the residuals will be used to identify parts 
of the population in which evidence of DIF exists. 

An extension of the approach of Rudas, Clogg, and Lindsay ( 1994) 
will be used to compare the fits of nested models using a measure of 
the relative fit of a model against a restricted alternative (see 
also Clogg, Rudas, & Xi, 1995). This will be applied to the "no DIF" 
and "uniform DIF" (see Mellenbergh, 1982, Holland, 1985) hypotheses 
of the Mantel-Haenszel (MH) type. 

Application of the procedure proposed in this paper produces an 
estimate of the minimum proportion of the population that would have 
to be removed in order to make the rest of the population free from 
DIF, as well as information about the specific portion of the 
population that is the apparent source of DIF in the above sense. 
This type of result may be more interpretable than conventional DIF 
statistics and may provide inf*ormation that can be used to modify 
test items. 

The paper is organized as follows: the next section formulates the 
hypotheses of no DIF and uniform DIF as MH-type hypotheses for a 
three-dimensional contingency table. Then simple methods for maximum 
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likelihood (ML) estimation of ti* under these hypotheses will be 
described, along with a method for testing the hypothesis that the 
fraction of the population that is free from DIF is greater than a 
specified value. The conclusions that can be drawn from inspecting 
the ir* values and 'i residuals will also be discussed. The next 
section will present nvmierical results for two data sets — a 
simulated data set and a set of examinee responses to the 1993 
Advanced Placement Physics B Exam. The last section discusses 
relative advantages and disadvantages of using the mixture index of 
fit n* in this context. 

The hypothesis of no differential item functioning 

Let A and B be two groups of respondents, often labeled as the focal 
and reference groups. The focal group is the group of primary 
interest and the reference group serves as a basis for comparison. 

The analysis of DIF can be conducted by comparing the reference and 
focal group odds of answering the item correctly, conditional on a 
measure of ability, such as a test score. Under the hypothesis of no 
DIF, group membership and item response (correct or incorrect) are 
conditionally independent, given ability. The following table gives 
the notation for the conditional probabilities at level j of the 
matching test score 



Response 

Correct Incorrect 



A 





Group 



B 





S 



f > 

1 # 



The hypothesis of no DIF is 



(2) H (no DIF) : ^ACJ^BIJ ^ ^ ^ ^ j=l, . . . , J. 

PAIj^BCj 

Ho 1 1 and ( 1985 ) suggested the use of the Mantel -Haensze 1 procedure 
for testing the hypothesis of no DIF (see also Holland, & Thayer, 
1988). The Mantel and Haenszel (1959) chi-square test approximates 
the uniformly most powerful unbiased test of the null hypothesis 
against the alternative that the conditional odds ratios (see Rudas 
& Leimer, 1992) in (2) are all equal to a common value other than 
one (Holland, & Thayer, 1988), which is the hypothesis of uniform 
DIF: 



^AC i^BI i 

(3) H (uniform DIF) : = a (^1) for all j. 

^AIJ^BCJ 

The amount of DIF, cis measured by the conditional odds ratio, is 
assumed to be constant over all levels of the matching variable. 

When the sample size of the focal group is much smaller than the 
sample size of the reference group, the method for fitting the same 
log-linear model to two groups of very different sizes described in 
Rudas (1991) may be applied instead of testing (2) against (3). 

Holland and Thayer (1988) discussed the relative advantages of 
testing (2) against (3) over other methods of testing for the 
presence of DIF (see also Zwick, 1990). They proposed the use of a 
transformation of the Mantel and Haenszel (1959) odds ratio 
estimator (i.e. the estimator under (3)), to measure the amount of 
DIF. In practice, a combination of the MH chi-square and odds ratio 
estimate is often used to assess the degree of DIP' in an item (see 
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Zieky, 1993). 



In the next section we provide an alternative way of assessing the 
amount of DIF by estimating the smallest fractions of the population 
that have the property that their complements can be described by 
hypotheses (2) and (3), respectively. The comparison of these two 
fractions can be used as a measure of the relative fits of 
hypotheses (2) and (3). 

The hypotheses considered in this section can be extended to items 
with more than two possible scores, such as partial credit items or 
items that are scored on an ordinal scale. Within each level of the 
matching variable, the data can be represented as a 2xL table, where 
L is the number of options. In this case, the association structure 
can be described by considering either the conditional means of the 
two groups (e.g. , see Zwick, Donoghue, & Grima, 1993a; Zwick & 
Thayer, in press) or the set of conditional odds ratios pertaining 
to these 2xL tables (Zwick, Donoghue, & Grima, 1993b). These can be 
the odds ratios batsed on neighboring columns (see Goodmam, 1979) or 
on the reference cell approach (see Rudas, 1991). The methodology 
discussed in the next section can be applied in these cases as v^ell, 
but iterative procedures are needed for fitting the models of no DIF 
or uniform DIF. A fourth, unobserved variable is introduced, showing 
whether or not an observation came from the part of the population 
in which the hypothesis holds. Then the EM algorithm (Dempster, 
Laird, i« Rubin, 1977) can be applied to fit the mixture in (1) with 
various trial values of tt. The value of ti* is the smallest value 
with which perfect fit can be achieved. This procedure is described 
in Rudas, Clogg, and Lindsay (1994) in a general form and will not 
be discussed here. On the other hand, when the responses are 
classified only as correct or incorrect, the MI_ estimate for tt* has 
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a closed form for the hypothesis of no DIF and can be obtained as 
the result of a finite-step maximization procedure for the 
hypothesis of uniform DIF. These procedures are considered in the 
next section. 



Estimating the fraction of population outside of the hypotheses of 
no DIF and uniform DIF 



The goal of the n* approach, sketched briefly in the introduction, 
is to consider the observed table of frequencies and taike away the 
smallest possible fraction of observations, so that what remains 
corresponds to the hypothesis of interest exactly. Note that the 
exact correspondence to the hypothesis which results dres not imply 
that the procedure overfits the model; rather it is a consequence of 
the fact that representation ( 1 ) always holds true with an 
appropriate value of tt. The ratio of the number of observations 
removed to the sample size is the ML estimate of the mixture index 
of fit Tt* and the distribution of the portion of observations that 
was taken away is the ML estimate of where ^ is zhc distribution 
in that part of the population in which the hypothesis of interest 
does not hold (Rudas, Clogg, & Lindsay, 1994). 



In the case of model (2), this leads to the follov/ing algorithm. For 
every level j of the matching variable, consider the table of 
observed frequencies and suppose that none of the entries are equal 
to zero. If the observed conditional odds ratio 



f f 

: ^ "ac/bij 

^ ; .c» X* 

'J "AIj BCj 



is gros-ter than a'-l, only the smaller of f , s-nd fp^ . needs to be 



BI i 






8 



reduced. The smaller of these, g .=min(f p., f __ J , must be reduced 



When a. is less than a=l, only the smaller of f .y . and f pp . needs to 



See Clogg, Rudas, aind Xi ( 1994) for related discussion. The ML 
estimate of n* for model (2) can be obtained as 



reduced. The cell of the conditional table in which the frequency is 
reduced is not uniquely defined, but the exmount of decrease, and 
therefore the value of h*. are uniquely defined. 

To design a simple algorithm yielding Tt*^(uniform DIF), con.sider (3) 
as the union of infinitely many hypotheses: 



(3) by first fixing a, finding n* ( a) , and then taking the 



J 

be reduced. The smaller of these, h 
reduced by 





71* (no DIF) = (1/N) Z d. , 

0 " 



where N is the total sample size. 



Vlhen f 



' „ .=f ^ . or f.T-.=f_„., 
ACj BIj AIj BCj 



either one of the frequencies can be 



H (uniform DIF) 




\J XX \ X • 

aelR “ (xetR ^Alj^BCj 
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infimum over the possible values of a. Note that is a prescribed 
corditional interaction model in the three-way table (see Rudas, 
1991). 

For arbitrary but fixed a, the algorithm to find 7r*(a) is exactly 
like the one described above for hypothesis (2). This yields a 7t*(a) 
value and the ML estimate under hypothesis (3) can be obtained as 

(4) 7t*(uniform DIF) = inf 7t*(a) . 



There is, however, no need to minimize over all positive a^l values. 
It can be assumed without loss of generality that the ability levels 
are indexed by j in ascending order, that is a , for every j. 

^ ^ yv vl J 



positive, and the second derivative is negative, implying that in 



outside of the range of the observed values, because for a<a^, 
7 t«(a)>ti»(aJ, and for oOa., 7 t»(a)>ir*(a ) . Therefore, it suffices to 
inspect only the values of 7i»(oc) at the observed ability levels. 



Note that the estimates for rr* do not depend on the sample size as 
do the chi-square values for the hypotheses of independence or 



If for some j, 




N 7 T*(a) = E h (1-a /a) + 

. . smi 1 



E 

i2:J+l 






or for a=a. Also, the minimuDi in (4) cannot occur for an a value 



7 t* (uniform DIF) = 



rain IT* (a) . 



a=a^, . . . , 
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conditional independence. If two saimples have the saime rflative 
frequencies, the estimates of the mixture index of fit n* are the 
same. 

The above algorithms assume that there are no zero observed 
frequencies in the data. If, for a given level of the matching 
variable, zeros occur in both cells of the same column (i.e. either 
everybody in both groups, or nobody in either group could answer the 
item correctly), this can be regarded as inconsistent with DIF; 
these 2x2 tables may be omitted from the analysis. Two other ways to 
eliminate zero cells which may be appropriate in some instances are 
combining the data across two or more levels of the matching 
variables (Donoghue & Allen, 1993) or smoothing the data by using a 
suitable prior or by adding small constants to the empty cells 
(Agresti, 1990). 

Having estimated m*(no DIF) and 7r*(uniform DIF), several inferential 
procedures are feasible. These parameters can be interpreted as the 
smallest possible fractions of the population that cannot be 
described by the model. The values of can be used as measures of 
the misfit of the respective models, i.e. as measxires of the amount 
of DIF. Also, these measures can be compared across items. 

The pattern of the residual i.e., the locations and relative 
sizes of the simounts that were removed from the conditional tables, 
provide information about where (in terms of ability level, group 
membership, and item response) DIF occurs. 

If the hypothesis of uniform DIF is extended to include the case of 
(X“-l, then hypothesis (2) is nested in hypothesis (3) and m*(no DIF) 
2 7 i*(uniform DIF). The difference between these two values can be 



I 




11 . 



used as a measure of how much better (3) fits the data than (2) 
does; i.e. v;hat fraction of the population is lost by restricting 
the value of the common conditional odds ratio to one. 

The above inferential procedures are illustrated in the next 
section. 

In some cases, testing the hypothesis that the proportion of the 
population in which DIF is present is less than a specific value, 
say, 7T may be of interest. This can be done by fitting the model 

P = (l-7r)$ + $eH(no DIF) 

to the data. To fit this model, standard latent class techniques can 
be used, which involve defining a fourth, unobserved, variable that 
identifies whether an observation came from the distribution $ or 
from the distribution and applying the EM algorithm (Dempster, 
Laird, & Rubin, 1977). Details of this procedure and properties of 
the resulting cbl-square statistic are described in Rudas, Clogg, 
and Lindsay (1994). 



Examples 

The first example is based on simulated data from a previous study 
(Zwick, Thayer, & Wingersky, 1994). The data consist of the item 
responses of 500 reference group (A) and 500 focal group (B) 
members. The reference group ability distribution was standeurd 
normal N(0, 1), while the focal group distribution was N(0.5, 1). 

The item responses were generated using a three-parameter logistic 
model (Birntaaum, 1968) . 



Li) 
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(5) 



P(8} = c + 



1 - c 



1 + exp( -1. 7a(0-b) ) 

where P(8) is the probability of answering the item correctly for an 

examinee with ability 8. The item used in the example had a lower 

asymptote of c=0. 15 and a discrimination of a=l in both groups. The 

reference group difficulty was b =0 and the focal group difficulty 

was b =0.35. The item response functions for the reference and focal 
r 

groups differed only in location; conditional on ability, the item 
was more difficult for the focal group. The measure of ability that 
served as a matching variable was the number-correct score on a 
75-item test that included the example item. 

For this analysis, the data can be summarized in a 76x2x2 

contingency table. The sufficient statistics for n* under (2) or (3) 

are the 76 (j=0 75) observed conditional odds ratios (a.) and 

u 

the frequencies g . and h , . 

°smj smj 

Out of the 304 observed frequencies, 103 were equal to zero; i.e. 
over one third of the cells were empty. Moreover, out of the 76 
conditional 2x2 tables, 45 contained at least one zero frequency; 
therefore more than half of the 76 conditional odds ratios were 
impossible to estimate from the data or yielded estimated values of 
zero. Eliminating the 2x2 tables that contained empty cells would 
have required the deletion of 351 observations — over one third of 
the sample - vfhich would not have been desirable. 

To overcome the problem of empty cells we repla.ced the zero 
frequencies with small positive values. To assess the effect of this 
approach, the main analysis was carried out with various choices of 
the flattening (or smoothing) vaJues. The values were either 
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constant (0.0001, 0.001, 0.01, 0,1, or 0.5), or uniformly 
distributed random on an interval starting at 0 and with the same 
expected values as above. 

The estimates of tt* for the hypotheses of no DIF and uniform DIF, 
using the above flattening values, au'e reported in Table 1. The main 
finding is that, for every choice of the flattening values, 7 r*(no 
DIF) and 7 i*(uniform DIF) ane very close to each other. 

*•* insert Table 1 around here *** 

The numerical results in Table 1 show that increases in the 
flattening values result in decreases in the estimates for tt* (for 
the flattening values included). Estimates for tt* under both 
hypotheses have their minima near the flattening constant 0.9, where 
the estimates are 0.06064 and 0.06057, respectively. Taking into 
account, however, that several observed frequencies were equal to 0 
or 1, it appears that 0.9 is too big to be used as a flattening 
constsmt . 

The results in Table 1 show that we estimate that about 7% of the 
population needs to be disregarded in order to remove DIF, or about 
93 /o the population can be described by the model of no DIF. The 
actual choice of the flattening constant has very little effect on 
this result. Rudas, Clogg, and Lindsay (1994) described a method of 
obtaining lower confidence bounds for n*.. With this data set. using 
the flattening value of 0.1, one obtains the 95% lower confidence 
bound of 0.055 (rounded value) for Tt^lno DIF). As the resulting 95% 
confidence interval does not contain zero, our procedure detects the 
DIF present in the original data generating mechanism. 



1.4 
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The difference 



( 6 ) 



7 T*Cno DIF) - K*(uniform DIF) 



can be used as a measure of the gain in fit due to using the model 
of uniform DIF over the model of no DIF. This quantity compares the 
estimates of the fractions of the population that cannot be 
described by the respective models. Although developing a formal 
test for the significance of this quantity is outside of the scope 
of the present paper, the results in Table 1 suggest that there is 
no substantial gain in using the model of uniform DIF to describe 
the data, compared to using the model of no DIF; in both cases we 
estimate that about 7’/, of the entire population (reference plus 
focal) ceinnot be described by the model. 

In what follows, results using the flattening constant 0.1 will be 
described to illustrate the conclusions that cm be reached using 
the n* approach. The following table shows the 2x2 marginal of for 
the hypothesis of no DIF multiplied by the sample size. These are 
the observations that have to be removed in order to achieve 
condition” 1 independence. 



These may be compared with the corresponding marginal of the 
observed data: 



Response 

Correct Incorrect 



Reference 



7.85 



12.57 



Group 



Focal 



14. 14 



35. 82 
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Correct 



Response 

Incorrect 

Reference 279 221 

Group 

Focal 326 174 

This shows that we estimate over 20% (35.82/174) of focal group 
members who answered the item incorrectly to be outside the model of 
no DIF, while in the other categories, the fractions are much 
smaller. The observations that were removed from among focal group 
members who answered the item incorrectly account for more than 50% 
(35.82/70.38) of the total number of observations that must be 
removed. This means that although the model of uniform DIF does not 
describe our data substantially better than the model of no DIF, the 
model of no DIF fails to account for some focal group members who 
did not answer correctly. This indicates the presence of some degree 
of DIF in favor of the reference group, as in the original mechanism 
of data generation. Note that the estimate of under the hypothesis 
of uniform DIF is very similar to the estimate under the hypothesis 
of no DIF and has the same interpretation as above. That is, both 
the magnitude of the misfit (as measured by tc*) and the pattern of 
residuals are similEir for the two hypotheses. 

Under the hypothesis of uniform DIF, the value of the conditional 
odds ratio for v;hich the minimum occurred is a(uniform DIF)=1 . 09375. 
There are only two types of conditional tables in which the pattern 
of decreases in cell counts is different for the no DIF and uniform 
DIF hypotheses: (1) tables in which one of the hypotheses holds 

Gxa.ct ly and (2) tables in which Is between and a (uniform 
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DIF). 



The value of a( uniform DIF) is equal to the odds ratio that was 
observed among those who had 44 correct answers. An interesting 
interpretation of this value can be obtained by noting that, out of 
the 1000 observations, 473 were in conditional tables where the 
estimated conditional odds ratio (after replacing each zero by 0.1) 
was less than 1.09375, 24 were in the conditional table where the 
estimated conditional odds ratio was exactly 1.0937S and 503 came 
from tables where the estimated conditional odds ratio was greater 
than 1.09375. This means that the u* approach led to a median-type 
estimate of the common conditional odds ratio. 

Plotting 4' against the number correct score may be informative in 
revealing the pattern of occurrence of DIF, but, because of the 
small value of m*, we did not apply this technique here. Note that 
for examinees with at least 47 correct answers, only the frequencies 
of the cells with incorrect responses were reduced (under both 
hypotheses) . 

The conventional MH DIF analysis involves calculation of the 1-lH 
chi-square and the index 

?ffl D-DIF = -2.35(lncx^_^^), 

a transformation of the MH odds ratio estima.te, the delta 

metric of item difficulty (Hollaiid, & ihayer, 1988). 

For the (unsmoothed) exajnple data, the MH chi-squiane statistic is 
0.30, Oj^=l.ll, and MH D-DIF is -0.24, with a standai'd error of 0.38 
(see Phillips 8. Holland, 1987). Since the chi-squarc statistic is 
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close to zero and MH D-DIF is close to its null value of zero, the 
conclusion from the MH analysis is that there is no reason to reject 
the hypothesis of no DIF. That is, the MH method fails to detect the 
DIF in the population, in contrast with the tt* approach. 

The data for the second example were taiken from the 1993 Advanced 
Placement Physics B Exam. There were 70 multiple choice items and 
the goal of the analysis was to detect male/female DIF. There were 
data available on 9104 male (reference group) and 4118 female (focal 
group) examinees. The matching variable was the number-correct score 
on the 70 items. Only results for the first 10 items will be 
reported here. Zero observed frequencies were replaced by 0.1, as in 
the previous analysis. 

Insert Table 2 around here*** 

The results are summarized in Table 2. For the 10 items considered, 
the 71 * values for the no-DIF hypothesis are between 0.02 and 0.06, 
and for the uniform-DIF hypothesis between 0.02 and 0.04, i.e. we 
estimate that for each item, DIF is absent in 94-98% of the 
population, and uniform DIF characterizes 97-98% of the population. 
The values of (6), showing the gain in fit due to assuming uniform 
DIF instead of no DIF, are between 0.00 and 0.03. For items 1, 2, 5, 
9, and 10, the uniform-DIF h5q3othesis does not fit better, as 
measured by the tz* index of fit, than the no-DIF hypothesis. The 
gain is the highest for items 3, and 7, namely 3%. Whether this gain 
should be considered substantial or not, may depend on several 
factors. One possible approach is to consider the ratio 7i( uni form 
DIF)/Tr(no DIF). This shows that for items 3, aind 7, the fraction of 
the population not described reduced by 50% as one moves from the 
no-DIF hypothesis to the tmiform-DIF hypothesis. 
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Except for items 3, 8, aind 10, the a(uniform DIF) values suggest 
superior item performance for males conditional on number-correct 
score. The magnitude of DIF is greatest (above 2) for item 4. 
Assuming a uniform DIF of this magnitude, leads to the description 
of an estimated 98% of the total population. No other assumed value 
of the common conditional odds ratio could lead to the description 
of a greater fraction of the population. 

There are several further analyses that are facilitated by the n* 
approach. For example, in the case of item 4, DIF appears to be 
concentrated at lower ability levels, and, consequently, examinees 
at higher ability levels are affected by DIF to a lesser degree. It 
was found, that 81% of the individuals who could not be described by 
the no-DIF hypothesis had number-correct scores below the median. 
Ninety-five percent of those who could not be described by the 
no-DIF hypothesis had number correct-scores below the 75th 
percentile. The corresponding figures for item 10 are 79% and 91% 
respectively, showing again a concentration of DIF at lower ability 
levels. All 10 items showed the same effect to some degree. 

*** Insert Table 3 around here *** 

Results of the MH analysis are reported in table 3. Items 3 and 8 
had odds ratios less than one, indicating that females tended to 
perform better, conditional on number-correct score, whereas the 
other items showed better conditional item performance for males. 
Using ETS criteria (Zieky, 1993), only item 4 shows substcuitial DIF 
against females. 

The analyses based on the ir’* approach and on the MH method agree 
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considerably as to the estimates of the common conditional odds 
ratios for all the 10 items of the test considered. In the case of 
item 10, the two analyses disagree concerning the direction of DIF; 
However, the estimated common conditional odds ratios are close to 
one in both ajialyses, and in the MH approach the result is not 
significant. However, the strength or importance of DIF is 
conceptualized in very different ways in the two approaches: the 

magnitude and statistical significance of the odds ratio estimate in 
the MH suialysis versus the size of the fraction of the population 
that cannot be described by the hypothesis of interest in the n* 
approach. 



Discussion 

The 71 * approach offers a new way to assess the importance of DIF in 
educational testing. The importance of DIF, in this approach, is 
influenced by the size of the svibgroup of the population in which 
DIF may be present, as well as the magnitude of DIF for this 
subpopulation. In this sense, the results of the tt* method, when 
applied to the problem of DIF, will depend to some degree on the 
distribution of the observations in the reference and focal groups, 
and the distribution of the matching variable. Note that the MH cdds 
ratios are also affected by the distribution of the examinees. The 
MH odds ratio estimate can be expressed as a weighted sum of the a. 

tj 

values, where the weights are a function of the observed 
within-level cell frequencies (Holland & Thayer, 1988). In addition, 
the examinee ability distribution can have taiintended effects on the 
MH odds ratios (Z\-7ick, 1990). 



The Tt* approach gives results with a straightforward interpretation. 



and may provide diagnostic information concerning the specific parts 
of the population where DIF is evident. By inspecting 'I', it might 
be found, for exajnple, that the lack of fit of the no-DIF hypothesis 
tended to occur among examinees who chose a particular incorrect 
response. This type of information could be helpful in pinpointing 
the source of DIF. Or it might be found that lack of fit to the 
no-DIF hypothesis occurred only ajnong examinees in the extremes of 
the test score distribution. This might be viewed as less 
consequential than DIF occurring near the mean of the distribution. 



Finally, it should be mentioned that recent research (Chang, Mazzeo, 
& Roussos, 1995; Roussos & Stout, 1993) has shown that under some 
circumstances, the SIBTEST method of DIF detection (Sheaiy, & Stout, 
1993) maintains better Type I error control than MH-type methods. 
The approach has no intrinsic connection to the MH method and 
could be applied in conjunction with SIBTEST or with other DIF 
detection methods as well. 
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Table 1 



Maximum likelihood estimates of for the hypotheses of 
no DIF and uniform DIF using different flattening values 
for the data generated by (5) 



Flattening value 


7T*(no DIF) 


7i*(uniform DIF) 


0.0001 


0.07338 


0.07274 


0.001 


0.07335 


0.07272 


0.01 


0.07309 


0.07243 


0. 1 


0.07039 


0.06953 


0.5 


0.06387 


0.06339 


U(0, 0.0002) 


0.07339 


0.07275 


U(0, 0.002) 


0.07338 


0.07274 


U(0, 0.02) 


0.07332 


0.07265 


U(0, 1) 


0.07071 


0.06928 
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Table 2 

Maximum likelihood estimates of 



n*(no DIF), n*Cuniform DIF), a(uniform DIF) 
for the first 10 items of the 1993 Advanced Placement Physics B Exam 



Item No 


Ti*(no DIF) 


n*(uniform DIF) 


a(uniform DIF) 


1 


0.03 


0.03 


1.03 


2 


0.03 


0.03 


1.08 


3 


0.05 


0.02 


0.63 


4 


0.04 


0.02 


2.08 


5 


0.03 


0.03 


1.24 


6 


0.03 


0.02 


1.28 


7 


0.06 


0.03 


1.62 


8 


0. 04 


0.03 


0.87 


9 


0.02 


0.02 


1. 15 


10 


0.02 


0.02 


0.92 



3i 



Table 3 

Results of the Mantel-Haenszel ajialysis for the first 
10 items of the 1993 Advanced Placement Physics B Exam 



Item No 


MH odds ratio 


MH D-DIF 


standard error 
of MH D-DIF 


1 


1.07 


-0. 16 


0. 11 


2 


1. 10 


-0.23 


0.10 


3 


0.69 


0.89 


0. 11 


4 


2.00 


-1.63 


0. 13 


5 


1.02 


-0.05 


0. 10 


6 


1. 16 


-0.35 


0. 10 


7 


1.49 


-0.94 


0. 10 


8 


0.87 


0.32 


0. 10 


9 


1.26 


-0.54 


0. 12 


10 


1.08 


-0. 19 


0. 12 
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